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Abstract —Deep neural networks, albeit their great success on 
feature learning in various computer vision tasks, are usually 
considered as impractical for online visual tracking because 
they require very long training time and a large number of 
training samples. In this work, we present an efficient and very 
robust tracking algorithm using a single Convolutional Neural 
Network (CNN) for learning effective feature representations of 
the target object, in a purely online manner. Our contributions 
are multifold: First, we introduce a novel truncated structural 
loss function that maintains as many training samples as possible 
and reduces the risk of tracking error accumulation. Second, we 
enhance the ordinary Stochastic Gradient Descent approach in 
CNN training with a robust sample selection mechanism. The 
sampling mechanism randomly generates positive and negative 
samples from different temporal distributions, which are gen¬ 
erated by taking the temporal relations and label noise into 
account. Finally, a lazy yet effective updating scheme is designed 
for CNN training. Equipped with this novel updating algorithm, 
the CNN model is robust to some long-existing difficulties in 
visual tracking such as occlusion or incorrect detections, without 
loss of the effective adaption for significant appearance changes. 
In the experiment, our CNN tracker outperforms all compared 
state-of-the-art methods on two recently proposed benchmarks 
which in total involve over 60 video sequences. The remarkable 
performance improvement over the existing trackers illustrates 
the superiority of the feature representations which are learned 
purely online via the proposed deep learning framework. 

I. Introduction 

Image features play a crucial role in many challenging 
computer vision tasks such as object recognition and detection. 
Unfortunately, in many online visual trackers features are 
manually defined and combined 0, 0, 0, (4). Even though 
these methods report satisfactory results on individual datasets, 
hand-crafted feature representations would limit the perfor¬ 
mance of tracking. For instance, normalized cross correlation, 
which would be discriminative when the lighting condition is 
favourable, might become ineffective when the object moves 
under shadow. This necessitates good representation learning 
mechanisms for visual tracking that are capable of capturing 
the appearance effectively changes over time. 

Recently, deep neural networks have gained significant 
attention thanks to their success on learning feature representa¬ 
tions. Different from the traditional hand-crafted features 0, 
0 , q, a multi-layer neural network architecture can effi¬ 
ciently capture sophisticated hierarchies describing the raw 
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data 0. In particular, the Convolutional Neural Networks 
(CNN) has shown superior performance on standard object 
recognition tasks 10, Col, CD, 02, m, which effectively 
learn complicated mappings while utilizing minimal domain 
knowledge. 

However, the immediate adoption of CNN for online visual 
tracking is not straightforward. First of all, CNN requires 
a large number of training samples, which is often not be 
available in visual tracking as there exist only a few number 
of reliable positive instances extracted from the initial frames. 
Moreover, CNN tends to easily overfit to the most recent 
observation, e.g., most recent instance dominating the model, 
which may result in drift problem. Besides, CNN training is 
computationally intensive for online visual tracking. Due to 
these difficulties, CNN has been treated as an offline feature 
extraction step on predefined datasets fl4l . fl5l for tracking 
applications so far. 

In this work, we propose a novel tracking algorithm using 
CNN to automatically learn the most useful feature representa¬ 
tions of particular target objects while overcoming the above 
challenges. We employ a tracking-by-detection strategy - a 
four-layer CNN model to distinguish the target object from 
its surrounding background. Our CNN generates scores for all 
possible hypotheses of the object locations (object states) in 
a given frame. The hypothesis with the highest score is then 
selected as the prediction of the object state in the current 
frame. We update this CNN model in an purely online manner. 
In other words, the proposed tracker is learned based only on 
the video frames for the interested object, no extra information 
or offline training is required. 

Typically, tracking-by-detection approaches rely on prede¬ 
fined heuristics to sample from the estimated object location 
to construct a set of positive and negative samples. Often 
these samples have binary labels, which leads to a few 
positive samples and a large negative training set. However, 
it is well-known that CNN training without any pre-learned 
model usually requires a large number of training samples, 
both for positive ones and negative ones. Furthermore, even 
with sufficient samples, the learner usually needs hundreds of 
seconds to achieve a CNN model with an acceptable accuracy. 
The slow updating speed could prevent the CNN model from 
being a practical visual tracker. To address these two issues, 
our CNN model employs a special type of loss function 
that consists of a structural term and a truncated norm. The 
structural term makes it possible to obtain a large number 
of training samples that have different significance levels 
considering the uncertainty of the object location at the same 
time. The truncated norm is applied on the CNN response to 
reduce the number of samples in the back-propagation 0, 
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Col stage to significantly accelerate the training process. 

We employ the Stochastic Gradient Decent (SGD) method 
to optimize the parameters in the CNN model. Since the 
standard SGD algorithm is not tailored for online visual 
tracking, we propose the following two modifications. First, 
to prevent the CNN model from overfitting to occasionally 
detected false positive instances, we introduce a temporal 
sampling mechanism to the batch generation in the SGD al¬ 
gorithm. This temporal sampling mechanism assumes that the 
object patches shall stay longer than those of the background 
in the memory. Therefore, we store all the observed image 
patches into training sample pool, and we choose the positive 
samples from a temporal range longer than the negative ones. 
In practice, we found this is a key factor in the robust 
CNN-based tracker, because discriminative sampling strategy 
successfully regularizes the training for effective appearance 
model. Secondly, the object locations, except the one on the 
first frame, is not always reliable as they are estimated by the 
visual tracker and the uncertainty is unavoidable m. One can 
treat this difficulty as the label noise problem fTTl . fl8l , fl9l . 
We propose to sample the training data in the joint distribution 
over the temporal variable (frame index) and the sample class. 
Here we compute the conditional probability of the sample 
class, given the frame index, based on a novel measurement 
of the tracking quality in that frame. In the experiment, further 
performance improvement is observed when the sample class 
probability is taken into account. 

For achieving a high generalization ability in various image 
conditions, we use multiple image cues (low-level image 
features, such as normalized gray-scale image and image 
gradient) as independent channels as network input. We update 
the CNN parameters by iteratively training each channel 
independently followed by a joint training on a fusion layer 
which replace the last fully-connected layers from multiple 
channels. The training processes of the independent channels 
and the fusion layer are totally decoupled. This makes the 
training efficient and empirically we observed that this two- 
stage iterative procedure is more accurate than jointly training 
for all cues. 

Finally, we propose to update the CNN model in a “lazy” 
style. First, the CNN-model is only updated when a significant 
appearance change occurs on the object. The intuition behind 
this lazy updating strategy is that we assume that the object ap¬ 
pearance is more consistent over the video, compared with the 
background appearances. Second, the fusion layer is updated 
in a coordinate-descent style and with a lower learning rate. 
The underlying assumption is that the feature representations 
can be updated fast while the contribution ratios of different 
image cues are more stable over all the frames. In practice, this 
lazy updating strategy not only increases the tracking speed 
significantly but also yields observable accuracy increase. 

To summarize, our main contributions include: 

• A visual tracker based on online adapting CNN is pro¬ 
posed. As far as we are aware, this is the first time a 
single CNN is introduced for learning the best features 
for object tracking in an online manner. 

• A structural and truncated loss function is exploited for 
the online CNN tracker. This enables us to achieve very 


reliable (best reported results in the literature) and robust 
tracking while achieving tracking speeds up to 4fps. 

• An iterative SGD method with an robust temporal sam¬ 
pling mechanism is introduced for competently capturing 
object appearance changes and meanwhile considering 
the label noise. 

Our experiments on two recently proposed benchmarks 
involving over 60 videos demonstrate that our method outper¬ 
forms all the compared state-of-the-art algorithms and rarely 
loses the track of the objects. In addition, it achieves a 
practical tracking speed (from 1.5fps to 4fps depending on 
the sequence and settings), which is comparable to many other 
visual trackers. 

II. CNN Architecture 
A. CNN with multiple image cues 

Our CNN consists of two convolutional layers and two 
fully-connected layers. The ReLU (Rectified Linear Unit) 
m is adopted as the activation function and max-pooling 
operators are used for dimension-reduction. The dark gray 
block in Fig. [I] shows the structure of our network, which can 
be expressed as (32 x 32) —>► (10 x 10 x 12) —>> (2 x 2 x 18) —>> 
(8) —(2) in conventional neural network notation. 

The input is locally normalized 32 x 32 image patches, 
which draws a balance between the representation power 
and computational load. The first convolution layer contains 
12 kernels each of size 13 x 13 (an empirical trade-off 
between overfitting due to a very large number of kernels and 
discrimination power), followed by a pooling operation that 
reduces the obtained feature map (filter response) to a lower 
dimension. The second layer contains 216 kernels with size 
7x7. This leads to a 72-dimensional feature vector in the 
second convolutional layer, after the pooling operation in this 
layer. 

The two fully connected layers firstly map the 72-D vector 
into a 8-D vector and then generate a 2-D confidence vector 
s = [si,S2] t £ 7 Z 2 , with si and S 2 corresponding to the 
positive score and negative score, respectively. In order to 
increase the margin between the scores of the positive and 
negative samples, we calculate the CNN score of the patch n 
as 

S(x n -,rt) = S n = Si • exp(si - S 2 ), (1) 

where x n denotes the input and the CNN is parameterized by 
the weights U. 

Effective object tracking requires multiple cues, which may 
include color, image gradients and different pixel-wise filter 
responses. These cues are weakly correlated yet contain com¬ 
plementary information. Local contrast normalized cues are 
previously shown m to produce accurate object detection 
and recognition results within the CNN frameworks. The 
normalization not only alleviates the saturation problem but 
also makes the CNN robust to illumination change, which is 
desired during the tracking. In this work, we use 3 image cues 
generated from the given gray-scale image, i.e., two locally 


3 




Feature Vector Label 

24 x 1 2x1 


Layer-2 Layer-3 Layer-4 

18,4x4 18,2 x 2 -» 72 8 


Input Patch 
32 x 32 


Layer-1 
12,20 x 20 


Normalized patches 
Cue-1 


Training patches 


Cue-2 


Cue-3 

Current frame 

Learned filters in Layer-1 






Fig. 1. The architecture of our CNN tracker with multiple image cues. The gray dashed blocks are the independent CNN channels for different image cues; 
the green dashed block is the fusion layer where a linear mapping M 24 —» M 2 is learned. 


normalized images with different parameter configurations [] 
and a gradient image. For color images, the first two cues are 
simply replaced with the H and V channels of the HSV color 
representation. Offering multiple image cues, we then let CNN 
to select the most informative ones in a data driven fashion. 
By concatenating the final responses of these 3 cues, we build 
a fusion layer (the green dashed block in Fig. [T]) to generate 
a 2-D output vector, based on which the final CNN score is 
calculated using Eq. [I] 

In our previous work l20l . ED, we proposed to use a set of 
CNNs Eq), or a single CNN ED with multiple (4) image cues 
for visual tracking. In this work, we employ a more complex 
CNN model (as described above) while less image cues to 
strike the balance between robustness and tracking speed. 
Other small yet important modifications from the previous 
model includes: 

• To better curb the overfitting, all the training samples are 
flipped as augmented data. 

• The pixel values of each the image cue are normalized to 
the range [0,10]. We found this normalization is crucial 
for balancing the importances between different image 
cues. 

B. Structural and truncated loss function 

1) Structural loss: Let x n and l n G {[0,1] T , [1,0] T } de¬ 
note the cue of the input patch and its ground truth label 
(background or foreground) respectively, and /(x n ; Ct) be the 
predicted score of x n with network weights 11, the objective 
function of N samples in the batch is 

1 N 

C=-Y / \\f^n^)-ln\\ 2 ( 2 ) 

n= 1 

when the CNN is trained in the batch-mode. Eq. [2] is a 
commonly used loss function and performs well in binary 

x Two parameters and r a determine a local contrast normalization 
process. In this work, we use two configurations, i.e., = 8, r a = 8} 

and {r^ = 12,7-0- = 12}, respectively. 


classification problems. However, for object localization tasks, 
usually higher performance can be obtained by ‘structurizing’ 
the binary classifier. The advantage of employing the struc¬ 
tural loss is the larger number of available training samples, 
which is crucial to the CNN training. In the ordinary binary- 
classification setting, one can only use the training samples 
with high confidences to avoid class ambiguity. In contrast, the 
structural CNN is learned based upon all the sampled patches. 

We modify the original CNN’s output to /(</>(T,y n ); Q) G 
M 2 , where T is the current frame, y n G M° is the motion 
parameter vector of the target object, which determines the 
object’s location in T and o is the freedom degre^] of the 
transformation. The operation </>(r,y n ) suffices to crop the 
features from T using the motion y n . The associated structural 
loss is defined as 

1 N 

[ A (y™>y*) • ll/0( r >yn);fi) -U 2 ], (3) 

n= 1 


where y* is the (estimated) motion state of the target object 
in the current frame. To define A(y n ,y*) we first calculate 
the overlapping score 0(y n ?y*) t22l as 


@(yn,y*) = 


area(r(y w )fV(y*)) 

area(r(y n ) (J r(y*)) 


(4) 


where r(y) is the region defined by y, f] and (J denotes the 
intersection and union operations respectively. Finally we have 


A(y ra ,y*) 


2 

1 + exp(—(0(y„, y*) - 0.5)) 


And the sample label l n is set as. 


e [0,1]. 

(5) 


i _/ I 1 - 0 ] 7 if 0(yn,y*)>O.5 

" \ [0,1] T elsewise 

From Eq. [5] we can see that A(y„. y*) actually measures the 
importance of the training patch n. For instance, patches that 


2 In this paper o = 3, i.e., the bounding box changes in its location and the 
scale. 
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are very close to object center and reasonably far from it may 
play more significant roles in training the CNN, while the 
patches in between are less important. 

In visual tracking, when a new frame T( t ) comes, we predict 
the object motion state y*^ as 

y*t) = argmax(/(0(r (t) ,y ra );fi)) , (6) 

y n^y 

where y contains all the test patches in the current frame. 

2) Truncated structural loss: Ordinary CNN models regress 
the input features into the target labels, via the / 2 -norm loss. 
One can directly adopt this strategy in the CNN-based tracking 
algorithm. However, to speed up the online training process, 
we employ a truncated / 2 -norm in our model. We empirically 
observe that patches with very small error does not contribute 
much in the back propagation. Therefore, we can approximate 
the loss by counting the patches with errors that are larger than 
a threshold. Motived by this, in ED, we define a truncated I 2 
norm as 

l[e||T = ||e|| 2 - (1 - l[||e|| 2 </?]), (7) 


where ![•] denotes the indicator function while e is the 
prediction error, e.g ., /(0(T, y n ); Q) — \ n for patch-n. In our 
previous work lf2TI . this truncated loss did increase the training 
speed, while at the cost of reducing the prediction accuracy. 

In this work, we observed that the tracking performance is 
more sensitive to the prediction error on positive samples than 
the negative samples. Recall that in training stage, we label 
each positive sample as [1,0 ] T and each negative sample as 
[0,1] T . In the test stage, the visual tracker selects the best 
particle among the ones with high scores. If the highest score 
in the current frame is large enough, the negative samples 
with small errors, which are ignored in training according to 
the truncated loss, will not affect the prediction. In contrast, if 
one ignores the positive samples with small errors in training, 
the selection among the top-n particles in the test stage will 
be consequently inaccurate, and thus drift problems could 
happen. In other words, we need a more precise loss function 
for positive samples in visual tracking. We thus improve the 
original truncated loss function as: 


e t = e 2 


1 - 1 


|e|| 2 < 


P 


(1 + U • l n ) 


( 8 ) 



Fig. 2. The truncated I 2 losses. The dashed green curve indicates the 
original I 2 loss, the red and blue curves are the truncated losses for positive 
and negative samples. 


III. Optimization of CNN for Tracking 
A. Online Learning: Iterative SGD with Temporal Sampling 

1) Temporal Sampling: Following other CNN-based ap¬ 
proaches 0. ED, we used Stochastic Gradient Decent (SGD) 
for the learning of the parameters Q. However, the SGD we 
employ is specifically tailored for visual tracking. 

Different from detection and recognition tasks, the training 
sample pool grows gradually as new frames come in visual 
tracking. Moreover, it is desired to learn a consistent object 
model over all the previous frames and then use it to distin¬ 
guish the object from the background in the current frame. 
This implies that we can effectively learn a discriminative 
model on a long-term positive set and a short-term negative 
set. 

Based on this intuition, we tailor the SGD method 
by embedding in a temporal sampling process. In par¬ 
ticular, given that the positive sample pool is = 

Nt(i)’ y£(i)> • • • md the ne S ative sample 

pool is YZt = when 

generating a mini-batch for SGD, we sample the positive pool 
with the probability 

Prob (y = ^ > ( 10 ) 

while sample the negative samples with the probability 


where u > 0 and l n = l n (l), i.e. , the scalar label of the n-th 
sample. This truncated norm is visualized in Fig. [2] and now 
Eq. [3] becomes: 

1 N 

£ = h Y I A (y«>y*) • ll/0( r >yn); fi ) -MW > ( 9 ) 

n= 1 

It is easy to see that with the truncated norm || • ||t, the 
backpropagation [91 process only depends on the training 
samples with large errors, i.e., ||/(0(T, y n ); Q) — Ulk > 0 . 
Accordingly, we can ignore the samples with small errors and 
the backpropagation procedure is significantly accelerated. In 
this work, we use (3 = 0.0025 and u = 3. 


Prob (yn,(t')) = ^ eX P \~ a( P - ^) 2 ] > (11) 

where is the normalization term and we use a = 10 in this 
work. 

In a way, the above temporal selection mechanism can 
be considered to be similar to the “multiple-lifespan” data 
sampling J23). However, f23l builds three different codebooks, 
each corresponding to a different lifespan, while we learn 
one discriminative model based on two different sampling 
distributions. 

3 Here we slightly abuse the notation of y, which denotes the motion state 
in the previous section. Here y indicates the cropped image patch according 
to the motion state. 







5 


2) Robust Temporal Sampling with Label Noise: In most 
tracking-by-detection strategy, the detected object y*^ is 
treated as a true-positive in the following training stage. 
However, among all the motion states y*^, Vt = 1,2,..., T, 
only the first one y*^ is always reliable as it is manually 
defined. Other motion states are estimated based on the 
previous observations. Thus, the uncertainty of the prediction 
y( t ), Vt > 1 is usually unavoidable ifTbl . Recall that, the 
structural loss defined in Eq. [4] could change significantly if a 
minor perturbation is imposed on y( t ), one requires a accurate 
y( t ) in every frame, which is, unfortunately, not feasible. 

In our previous work ED, we take the uncertainty into 
account by imposing a robust term on the loss function [9] The 
robust term is designed in the principle of Multiple-Instance- 
Learning (24), (25) and it alleviates over-fittings in some 
scenarios. However, the positive-sample-bag (TP could also 
reduce the learning effectiveness as it will confuse the learner 
when two distinct samples are involved in one bag. Actually, 
other MIL-based trackers also suffer from this problem ED, 

EH- 

In this work, we propose a much simpler scheme for 
addressing the issue of prediction uncertainty. Specifically, 
the prediction uncertainty is casted as a label noise problem 
E3, EH, Da. We assume there exist some frames, on 
which the detected “objects” are false-positive samples. In 
other words, the some sample labels in Yf :t and Y^ :t are 
contaminated (flipped in the binary case). In the context of 
temporal sampling, the assumption introduces an extra random 
variable r] which represent the event that the label is true 
(77 = 1) or not (77 = 0). The sampling process is now conduct 
in the joint probability space {n = 1,2, ••• ,N} x {t' = 
1, 2 , • • • , t} x {77 = 1,0} and the joint probability is 


Pr ° b (y^ (t ,),r7 = !), (12) 


where stands for the selection of the n-th posi- 

tive/negative sample in the t'-th frame. According to the chain- 
rule, we have 


p r° b (y ^ (4 ,),77 = 1) = Prob(f',n ,77 = 1) 

= Prob (?7 = 1 | t\n) - Prob(f / , n) ( 13 ) 

= Prob (77 = 1 | t',n) ■ Prob (y± (t , } ) 


where Prob 
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and 


11 


while the 


t ,^j is given in Eq. 
conditional probability Prob (?7 = 1 ' p -1 £', nj^eflects the 
likelihood that the label of sample (#) not con t am i na t e d- 
To estimate Prob (77 = 1 I t',n) efficiently, we assume 
that in the same frame, the conditional probabilities are equal 
for Vn = 1,2,--- ,7V. Then we propose to calculate the 
probability as a prediction quality Q t , in frame-7', i.e., 


Qt' = Prob (77 = 1 | t',n) = 

1 N 

i-sE [A(y n , ( t 0 ,y* t0 ) • ||/(0(r,y n , ( t O };Q) -1„|| T ] , 

I I neP 

(14) 


scores. Mathematically, it is defined as 


P = {Vn | S ni(t0 > 1 ; (15) 


where 5^,(7') an( i S* t ') are CNN scores (see Eq. fll 
the n-th sample and the sample selected as object in rra 


of 


rame 

t ', respectively. The underlying assumption of Eq. 14 is that, 
a detection heat-map with multiple widely-distributed peaks 
usually implies low detection quality, as there is only ONE 
target in the video sequence. This tracking quality is illustrated 
in Fig. [3] From the figure we can see that when occlusion 
(middle) or significant appearance change (right) occurs, the 
tracking quality drops dramatically and thus the samples in 
those “contaminated” frames are rarely selected according to 

Eq.m 


3) Iterative Stochastic Gradient Descent (IT-SGD): Re¬ 
call that we use multiple image cues as the input of the 
CNN tracker. This leads to a CNN with higher complexity, 
which implies a low training speed and a high possibil¬ 
ity of overfitting. By noticing that each image cue may 
be weakly independent, we train the network in a iterative 
manner. In particular, we define the model parameters as 
^ = { W Y> • • • , w /c> • • • > w fc> w /«se}, where w k cov 

denotes the filter parameters in cue-fc, w^ c corresponds to 
the fully-connected layers and w f use parameterize the fusion 
layer. 

In this work, we conduct the SGD process iteratively over 
different image cues and the fusion layer. In specific, after we 
complete the training on w* ov and w^ c , we evaluate the filter 
responses from that cue in the last fully-connected layer and 
then update w f use on the dimensions corresponding to cu e-fc . 
This can be regarded as a coordinate-descent variation of SGD. 
In practice, we found out both the robust temporal sampling 
mechanism and the IT-SGD significantly curb the overfitting 
problem. The iterative SGD is illustrated in Algorithm [I] 


Algorithm 1 Iterative SGD with robust temporal sampling 

1 : Inputs: Frame image T( t ); Two sample pools Y^ t , Y^ :t ; 

2 : Old CNN model (K cues) /o(0(T(t), •); Q). 

3: Estimated/given y* t ^; 

4: Learning rates r; f; minimal loss e; training step budget M. 

5: procedure IT-SGD(Y+ t , Y“ t , /, y*, f, r, M) 

6 : Selected samples {yi,( t ),y 2 ,(t), • • •, Yiv,(t)}. 

7: Generate associated labels li,(t), - - - , 1 N,(t) according to y* t ) . 

8 : Estimate the prediction quality Q t . 

9: Save the current samples and labels into Y^ t and Yf :t . 

10 : Sample training instances according to Prob(y± 77 1). 

11 : for m 0, M — 1 do 

N 

12 : Cm = A ^3 [A(yn,y*) • ||/m(^(r (t ),y„};Q) - In|| x ]; 

n=l 

13: If Cm < e, break; 

14: k — mod(m, K ) + 1; 

15: Update w c fc OT and wj c using SGD with learning rate r. 

16: Update w f use partially for cue-/c, with learning rate r. 

17: Save fm -fi — /m? 

18: end for 

19: end procedure 

20 : Outputs: New CNN model f* — f m *, m* — argmin m C m . 


where the set P contains the sample in the frame t r with high 
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Fig. 3. A demonstration of the prediction quality on three different frames in the sequence tiger 1. For each frame, the overlaying heat-map indicates the 
distribution of the high-score particles while the blue box is the detected y* t /y The tracking qualities are shown on the top of the frame images. Note that 
the quality is estimated without ground-truth information. 


B. Lazy Update and the Overall Work Flow 

It is straightforward to update the CNN model using the 
IT-SGD algorithm at each frame. However, this could be 
computationally expensive as the complexity of training pro¬ 
cesses would dominate the complexity of the whole algorithm. 
On the other hand, in case the appearance of the object is 
not always changing, a well-learned appearance model can 
remain discriminant for a long time. Furthermore, when the 
feature representations is updated for adapting the appearance 
changes, the contribution ratios of different image cues could 
remain more stable over all the frames. 

Motivated by the above two intuitions, we propose to update 
the CNN model in a lazy manner. First, when tracking the 
object, we only update the CNN model when the training loss 
Ci is above 2e. Once the training start, the training goal is 
to reduce C below 5. As a result, usually C\ < 2s holds 
in a number of the following frames, and thus no training is 
required for those frames. This way, we accelerate the tracking 
algorithm significantly (Fig. [4]). Second, we update the fusion 
layer in a lazy, z.e., a coordinate-descent manner with a small 
learning rate (see Algorithm [I]). The learning process is thus 
stabilized well. In this work, we set that 5 = 5e-3, r = 5e-2 
and r = 5e-3. 

IV. Experiments 

A. Benchmarks and experiment setting 

We evaluate our method on two recently proposed visual 
tracking benchmarks, i.e., the CVPR2013 Visual Tracker 
Benchmark f26l and the VOT2013 Challenge Benchmark 
0 . These two benchmarks contain more than 60 sequences 
and cover almost all the challenging scenarios such as scale 
changes, illumination changes, occlusions, cluttered back¬ 
grounds and motion blur. Furthermore, these two benchmarks 
evaluate tracking algorithms with different measures and cri¬ 
teria, which can be used to analyze the tracker from different 
views. 

In the experiments on two selected benchmarks, we use the 
same parameter values for DeepTrack. Most parameters of the 
CNN tracker are given in Sec. [II] and Sec. [Ill] In addition, there 
are some motion parameters for sampling the image patches. 
In this work, we only consider the displacement , A y and 


the relative scale s of the objecQ In a new frame, we sample 
1500 random patches in a Gaussian Distribution which centers 
on the previous predicted state. The standard deviation for the 
three dimensions are min(10, 0.5• /z), min(10, 0.5 -h) and 0.01- 
h, respectively. Note that, all parameters are fixed for all videos 
in both two benchmarks; no parameter tuning is performed for 
any specific video sequence. We run our algorithm in Matlab 
with an unoptimized code mixed with CUDA-PTX kernels for 
the CNN implementation. The hardware environment includes 
one quad-core CPU and a NVIDIA GTX980 GPU. 

B. Comparison results on the CVPR2013 benchmark 

The CVPR2013 Visual Tracker Benchmark (23 contains 
50 fully annotated sequences. These sequences include many 
popular sequences used in the online tracking literature over 
the past several years. For better evaluation and analysis of the 
strength and weakness of tracking approaches, these sequences 
are annotated with the 11 attributes including illumination 
variation, scale variation, occlusion, deformation, motion blur, 
fast motion, in-plane rotation, out-of-plane rotation, out-of- 
view, background clutters, and low resolution. The benchmark 
contains the results of 29 tracking algorithms published before 
the year 2013. Here, we compare our method with other 11 
tracking methods. Among the competitors, TPGR (281 and 
KCF lf29l are the most recently state-of-the-art visual trackers; 
TLD (23, VTD (3H, CXT El, ASLA El, Struck 01, SCM 
El are the top -6 methods as reported in the benchmark; CPF 
CD, IVT 03 and MIL CD are classical tracking methods 
which are used as comparison baselines. 

The tracking results are evaluated via the following two 
measurements: 1) Tracking Precision (TP), the percentage 
of the frames whose estimated location is within the given 
distance-threshold (r^) to the ground truth, and 2) Track¬ 
ing Success Rate (TSR), the percentage of the frames in 
which the overlapping score defined in Eq. [4] between the 
estimated location and the ground truth is larger than a 
given overlapping-threshold (r 0 ). Following the setting in the 
recently published work l28l . 129], we conduct the experiment 
using the OPE (one-pass evaluation) evaluation strategy for a 
better comparison to the latest methods. 

4 s = h/ 32, where h is object’s height 
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Fig. 4. Work flow of proposed algorithm. The bottom row shows the three-stages operations on a frame: test, estimation and training. In the training frames, 
the green bounding-boxes are the negative samples while the red ones denote the positive samples. The dashed block covers the positive sample pool Y+ 
(red) and negative sample pool Y _ (green). In each pool, the edges of the sample patches indicate their sampling importances. The green ones (negative) and 
red ones (positive) represent the prior probabilities of sample selection while the purple ones stands for the conditional probabilities (Q(t)). The thicker the 
edge, the higher the probability. 


Firstly, we evaluate all algorithms using fixed thresholds, 
i.e., Td = 20, t 0 = 0.6, which is a standard setting in 
tracking evaluations l26ll . Results for all the involved trackers 
and all the video sequences are given in Table [T| According 
to the table, our method achieves better average performance 
compared with other trackers. The performance gap between 
our method and the reported best result in the literature are 6% 
for the TP measure: our method achieves 83% accuracy while 
the best state-of-the-art is 77% (TGPR method). For the TSR 
measure, our method is 8% better than the existing methods: 
our method gives 63% accuracy while the best state-of-the- 
art is 55% (SCM method). Furthermore, our CNN tracker 
have ranked as the best method for 33 times. These numbers 
for TGPR, KCF, SCM and Struck are 21, 28, 19 and 21 
respectively. Another observation from the Table [I] is that, 
DeepTrack rarely performs inaccurately; there are only 36 
occasions when the proposed tracker performs significantly 
poorer than the best method (no less then 80% of the highest 
score for one sequence). 

In fact, the superiority of our method becomes more clear 
when the tracking result are evaluated using different measure¬ 
ment criteria (different 7 ^, r 0 ). In specific, for TP, we evaluate 
the trackers with the thresholds Td = 1, 2, • • • ,50 while for 
TSR, we use the thresholds r Q = 0 to 1 at the step of 0.05. 
Accordingly we generate the precision curves and the success- 
rate curves for each tracking method, which is shown in Fig. [5] 

From the score plots we can see that, overall the CNN 
tracker ranks the first (red curves) for both TP and TSR 
evaluations. The proposed DeepTrack method outperform all 
the other trackers when r Q < 0.68 and Td > 10. When the 
evaluation threshold is reasonably loose, (i.e., t q < 0.45 and 


Td > 20), our algorithm is very robust with both the accuracies 
higher than 80%. Having mentioned that when the overlap 
thresholds are tight ( e.g. t q > 0.75 or Td < 5), our tracker has 
similar response to rest of the trackers we tested. 

In many applications, it is more important to not to loose 
the target object than very accurately locate its bounding box. 
As visible, our tracker rarely looses the object. It achieves the 
accuracies around 90% when r 0 < 0.3 and Td > 30. 

Fig. [6] shows the performance plots for 11 kinds of diffi¬ 
culties in visual tracking, i.e., fast-motion, background-clutter, 
motion-blur, deformation, illumination-variation, in-plane- 
rotation, low-resolution, occlusion, out-of-plane-rotation, out- 
of-view and sc ale-variations. We can see that the proposed 
DeepTrack outperforms other competitors for all the difficul¬ 
ties except the “out-of-view” category. 

C. Comparison results on the VOT2013 benchmark 

The VOT2013 Challenge Benchmark [27 ] provides an eval¬ 
uation kit and the dataset with 16 fully annotated sequences 
for evaluating tracking algorithms in realistic scenes subject to 
various common conditions. The tracking performance in the 
VOT2013 Challenge Benchmark is primarily evaluated with 
two evaluation criteria: accuracy and robustness. The accuracy 
measure is the average of the overlap ratios over the valid 
frames of each sequence while the tracking robustness is the 
average number of failures over 15 runs. A tracking failure 
happens once the overlap ratio measure drops to zero and an 
re-initialization of the tracker in the failure frame is conducted 
so it can continue. According to the evaluation protocol, three 
types of experiments are conducted. In Experiment-1, the 
tracker is run on each sequence in the dataset 15 times by 
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Struck 

MIL 

VTD 

CXT 

SCM 

TLD 

ASLA 

IVT 

CPF 

KCF 

TGPR 

DeepTrack 

tigerl 

0.17/0.13 

0.09/0.07 

0.12/0.09 

0.37/0.17 

0.13/0.11 

0.46/0.36 

0.23/0.15 

0.08/0.07 

0.39/0.24 

0.97/0.94 

0.28/0.22 

0.56/0.36 

carDark 

1.00/1.00 

0.38/0.09 

0.74/0.66 

0.73/0.67 

1.00/0.98 

0.64/0.50 

1.00/0.99 

0.81/0.69 

0.17/0.02 

1.00/0.44 

1.00/0.95 

1.00/0.97 

girl 

1.00/0.90 

0.71/0.25 

0.95/0.41 

0.77/0.61 

1.00/0.74 

0.92/0.61 

1.00/0.78 

0.44/0.17 

0.74/0.40 

0.86/0.47 

0.92/0.69 

0.98/0.83 

david 

0.33/0.19 

0.70/0.05 

0.94/0.38 

1.00/0.48 

1.00/0.84 

1.00/0.83 

1.00/0.94 

1.00/0.65 

0.19/0.02 

1.00/0.26 

0.98/0.26 

1.00/0.76 

singerl 

0.64/0.20 

0.50/0.20 

1.00/0.36 

0.97/0.27 

1.00/1.00 

1.00/0.93 

1.00/0.98 

0.96/0.35 

0.99/0.10 

0.81/0.20 

0.68/0.19 

1.00/1.00 

skating 1 

0.47/0.20 

0.13/0.08 

0.90/0.43 

0.23/0.06 

0.77/0.21 

0.32/0.21 

0.77/0.45 

0.11/0.05 

0.23/0.17 

1.00/0.23 

0.81/0.25 

1.00/0.45 

deer 

1.00/0.94 

0.13/0.07 

0.04/0.03 

1.00/0.87 

0.03/0.03 

0.73/0.73 

0.03/0.03 

0.03/0.03 

0.04/0.03 

0.82/0.76 

0.86/0.79 

1.00/0.99 

singer2 

0.04/0.03 

0.40/0.27 

0.45/0.43 

0.06/0.04 

0.11/0.13 

0.07/0.05 

0.04/0.03 

0.04/0.04 

0.12/0.09 

0.95/0.89 

0.97/0.91 

0.57/0.34 

car4 

0.99/0.26 

0.35/0.23 

0.36/0.32 

0.38/0.27 

0.97/0.93 

0.87/0.63 

1.00/0.95 

1.00/1.00 

0.14/0.01 

0.95/0.24 

1.00/0.28 

1.00/1.00 

tiger2 

0.63/0.42 

0.41/0.23 

0.16/0.08 

0.34/0.16 

0.11/0.05 

0.39/0.04 

0.14/0.11 

0.08/0.05 

0.11/0.04 

0.36/0.28 

0.72/0.47 

0.49/0.32 

dudek 

0.90/0.81 

0.69/0.76 

0.88/0.96 

0.82/0.87 

0.88/0.86 

0.60/0.63 

0.75/0.74 

0.89/0.88 

0.57/0.58 

0.88/0.82 

0.75/0.71 

0.73/0.81 

Sylvester 

0.99/0.85 

0.65/0.46 

0.82/0.74 

0.85/0.56 

0.95/0.77 

0.95/0.80 

0.82/0.65 

0.68/0.63 

0.86/0.52 

0.84/0.73 

0.96/0.93 

1.00/0.92 

jumping 

1.00/0.50 

1.00/0.33 

0.21/0.08 

1.00/0.25 

0.15/0.11 

1.00/0.70 

0.45/0.15 

0.21/0.08 

0.16/0.09 

0.34/0.26 

0.95/0.50 

1.00/0.93 

david2 

1.00/1.00 

0.98/0.24 

1.00/0.88 

1.00/1.00 

1.00/0.80 

1.00/0.70 

1.00/0.95 

1.00/0.74 

1.00/0.25 

1.00/1.00 

1.00/0.97 

1.00/0.87 

shaking 

0.19/0.04 

0.28/0.18 

0.93/0.83 

0.13/0.04 

0.81/0.69 

0.41/0.31 

0.48/0.17 

0.01/0.01 

0.17/0.07 

0.02/0.01 

0.97/0.70 

0.95/0.68 

trellis 

0.88/0.72 

0.23/0.16 

0.50/0.44 

0.97/0.69 

0.87/0.84 

0.53/0.45 

0.86/0.85 

0.33/0.26 

0.30/0.14 

1.00/0.74 

0.98/0.68 

1.00/0.96 

woman 

1.00/0.89 

0.21/0.18 

0.20/0.16 

0.37/0.15 

0.94/0.69 

0.19/0.15 

0.20/0.17 

0.20/0.17 

0.20/0.05 

0.94/0.90 

0.97/0.87 

0.98/0.24 

fish 

1.00/1.00 

0.39/0.28 

0.65/0.57 

1.00/1.00 

0.86/0.85 

1.00/0.96 

1.00/1.00 

1.00/1.00 

0.11/0.08 

1.00/1.00 

0.97/0.97 

1.00/1.00 

matrix 

0.12/0.12 

0.18/0.10 

0.22/0.03 

0.06/0.01 

0.35/0.24 

0.16/0.03 

0.05/0.01 

0.02/0.02 

0.09/0.02 

0.17/0.11 

0.39/0.26 

0.72/0.43 

ironman 

0.11/0.02 

0.11/0.02 

0.17/0.12 

0.04/0.03 

0.16/0.09 

0.12/0.04 

0.13/0.08 

0.05/0.05 

0.05/0.04 

0.22/0.10 

0.22/0.13 

0.08/0.05 

mhyang 

1.00/0.97 

0.46/0.25 

1.00/0.77 

1.00/1.00 

1.00/0.96 

0.98/0.52 

1.00/1.00 

1.00/1.00 

0.79/0.08 

1.00/0.93 

0.95/0.88 

1.00/0.96 

liquor 

0.39/0.40 

0.20/0.20 

0.52/0.52 

0.21/0.21 

0.28/0.29 

0.59/0.54 

0.23/0.23 

0.21/0.21 

0.52/0.53 

0.98/0.97 

0.27/0.27 

0.91/0.89 

motorRolling 

0.09/0.09 

0.04/0.06 

0.05/0.05 

0.04/0.02 

0.04/0.05 

0.12/0.10 

0.06/0.07 

0.03/0.04 

0.06/0.04 

0.05/0.05 

0.09/0.10 

0.80/0.43 

coke 

0.95/0.87 

0.15/0.08 

0.15/0.11 

0.65/0.15 

0.43/0.24 

0.68/0.09 

0.16/0.10 

0.13/0.13 

0.39/0.03 

0.84/0.41 

0.95/0.63 

0.91/0.18 

soccer 

0.25/0.15 

0.19/0.14 

0.45/0.18 

0.23/0.12 

0.27/0.16 

0.11/0.11 

0.12/0.11 

0.17/0.14 

0.26/0.16 

0.79/0.35 

0.16/0.13 

0.30/0.16 

boy 

1.00/0.93 

0.85/0.29 

0.97/0.61 

0.94/0.42 

0.44/0.44 

1.00/0.74 

0.44/0.44 

0.33/0.31 

1.00/0.82 

1.00/0.96 

0.99/0.91 

1.00/0.93 

basketball 

0.12/0.09 

0.28/0.20 

1.00/0.85 

0.04/0.02 

0.66/0.53 

0.03/0.02 

0.60/0.26 

0.50/0.08 

0.74/0.54 

0.92/0.71 

0.99/0.69 

0.82/0.39 

lemming 

0.63/0.49 

0.82/0.68 

0.51/0.42 

0.73/0.38 

0.17/0.16 

0.86/0.43 

0.17/0.17 

0.17/0.17 

0.88/0.40 

0.49/0.30 

0.35/0.26 

0.28/0.26 

bolt 

0.02/0.01 

0.01/0.01 

0.31/0.14 

0.03/0.01 

0.03/0.01 

0.31/0.08 

0.02/0.01 

0.01/0.01 

0.91/0.15 

0.99/0.75 

0.02/0.01 

0.99/0.78 

crossing 

1.00/0.72 

1.00/0.83 

0.44/0.36 

0.62/0.32 

1.00/0.99 

0.62/0.41 

1.00/0.99 

1.00/0.23 

0.89/0.38 

1.00/0.78 

1.00/0.81 

0.94/0.56 

couple 

0.74/0.51 

0.68/0.61 

0.11/0.06 

0.64/0.52 

0.11/0.11 

1.00/0.98 

0.09/0.09 

0.09/0.09 

0.87/0.58 

0.26/0.24 

0.60/0.35 

0.99/0.63 

davidJ 

0.34/0.34 

0.74/0.60 

0.56/0.44 

0.15/0.10 

0.50/0.47 

0.11/0.10 

0.55/0.49 

0.75/0.41 

0.57/0.33 

1.00/0.96 

1.00/0.69 

1.00/0.93 

carScale 

0.65/0.37 

0.63/0.35 

0.55/0.42 

0.74/0.74 

0.65/0.64 

0.85/0.29 

0.74/0.65 

0.78/0.67 

0.67/0.32 

0.81/0.35 

0.79/0.37 

0.67/0.56 

doll 

0.92/0.34 

0.73/0.20 

0.97/0.73 

0.99/0.87 

0.98/0.97 

0.98/0.39 

0.92/0.91 

0.76/0.27 

0.94/0.84 

0.97/0.33 

0.94/0.40 

0.96/0.86 

skiing 

0.04/0.04 

0.07/0.06 

0.14/0.01 

0.15/0.06 

0.14/0.06 

0.12/0.05 

0.14/0.11 

0.11/0.09 

0.06/0.01 

0.07/0.05 

0.12/0.10 

0.09/0.06 

football 

0.75/0.57 

0.79/0.67 

0.80/0.65 

0.80/0.57 

0.77/0.42 

0.80/0.28 

0.73/0.62 

0.79/0.61 

0.97/0.60 

0.80/0.57 

1.00/0.75 

0.79/0.52 

footballl 

1.00/0.72 

1.00/0.55 

0.99/0.51 

1.00/0.96 

0.57/0.34 

0.55/0.34 

0.80/0.39 

0.81/0.49 

1.00/0.58 

0.96/0.80 

0.99/0.41 

1.00/0.38 

freemanl 

0.80/0.16 

0.94/0.12 

0.95/0.13 

0.73/0.18 

0.98/0.54 

0.54/0.18 

0.39/0.20 

0.81/0.26 

0.76/0.18 

0.39/0.13 

0.93/0.21 

1.00/0.35 

freeman3 

0.79/0.12 

0.05/0.00 

0.72/0.22 

1.00/0.89 

1.00/0.88 

0.77/0.42 

1.00/0.90 

0.76/0.33 

0.17/0.14 

0.91/0.21 

0.77/0.15 

0.97/0.67 

freeman4 

0.37/0.15 

0.20/0.02 

0.37/0.08 

0.43/0.17 

0.51/0.18 

0.41/0.24 

0.22/0.16 

0.35/0.17 

0.12/0.02 

0.53/0.12 

0.58/0.21 

0.71/0.22 

subway 

0.98/0.63 

0.99/0.68 

0.23/0.18 

0.26/0.20 

1.00/0.90 

0.25/0.22 

0.23/0.21 

0.22/0.19 

0.22/0.10 

1.00/0.94 

1.00/0.99 

1.00/0.79 

SUV 

0.57/0.57 

0.12/0.12 

0.52/0.47 

0.91/0.90 

0.98/0.80 

0.91/0.70 

0.57/0.55 

0.45/0.44 

0.78/0.63 

0.98/0.98 

0.66/0.66 

0.52/0.52 

walking 

1.00/0.42 

1.00/0.37 

1.00/0.55 

0.24/0.22 

1.00/0.86 

0.96/0.30 

1.00/0.99 

1.00/0.98 

1.00/0.65 

1.00/0.34 

1.00/0.41 

1.00/0.94 

walking2 

0.98/0.32 

0.41/0.31 

0.41/0.39 

0.41/0.39 

1.00/0.99 

0.43/0.29 

0.40/0.40 

1.00/0.99 

0.36/0.35 

0.44/0.30 

0.99/0.31 

0.61/0.38 

mountainBike 

0.92/0.67 

0.67/0.41 

1.00/0.81 

0.28/0.28 

0.97/0.72 

0.26/0.21 

0.90/0.82 

1.00/0.84 

0.15/0.06 

1.00/0.88 

1.00/0.87 

1.00/0.91 

faceoccl 

0.58/0.95 

0.22/0.46 

0.53/0.72 

0.34/0.57 

0.93/1.00 

0.20/0.65 

0.18/0.25 

0.64/0.87 

0.32/0.41 

0.73/0.99 

0.66/0.80 

0.33/0.42 

jogging-1 

0.24/0.22 

0.23/0.21 

0.23/0.18 

0.96/0.95 

0.23/0.21 

0.97/0.95 

0.23/0.22 

0.22/0.22 

0.54/0.23 

0.23/0.22 

0.99/0.96 

0.97/0.94 

jogging-2 

0.25/0.22 

0.19/0.16 

0.19/0.16 

0.16/0.15 

1.00/0.98 

0.86/0.83 

0.18/0.17 

0.20/0.19 

0.84/0.72 

0.16/0.15 

1.00/0.95 

0.99/0.30 

dogl 

1.00/0.51 

0.92/0.45 

0.83/0.61 

1.00/0.95 

0.98/0.76 

1.00/0.61 

1.00/0.87 

0.98/0.80 

0.91/0.90 

1.00/0.51 

1.00/0.52 

1.00/0.95 

fleetface 

0.64/0.51 

0.36/0.32 

0.66/0.68 

0.57/0.60 

0.53/0.58 

0.51/0.41 

0.30/0.32 

0.26/0.24 

0.16/0.21 

0.46/0.47 

0.45/0.47 

0.51/0.60 

faceocc2 

1.00/0.97 

0.74/0.62 

0.98/0.84 

1.00/0.90 

0.86/0.74 

0.86/0.51 

0.79/0.61 

0.99/0.77 

0.40/0.29 

0.97/0.79 

0.47/0.45 

1.00/0.71 

Overall 

0.66/0.48 

0.47/0.28 

0.58/0.41 

0.58/0.43 

0.65/0.55 

0.61/0.42 

0.53/0.46 

0.50/0.38 

0.49/0.28 

0.74/0.53 

0.77/0.54 

0.83/0.63 

No. Best 

21 

4 

10 

16 

19 

11 

18 

11 

4 

28 

21 

33 

No. Bad 

62 

89 

71 

66 

51 

72 

64 

74 

84 

48 

45 

36 


TABLE I 

The tracking scores of DeepTrack and other visual trackers on the CVPR2013 benchmark. The reported results are shown in 

THE ORDER OF “TP/TSR”. THE TOP SCORES ARE SHOWN IN RED FOR EACH ROW. A SCORE IS SHOWN IN BLUE IF IT IS HIGHER THAN 80% OF THE 
HIGHEST VALUE IN THAT ROW. “NO. BEST” ROW SHOWS THE NUMBER OF BEST SCORES FOR EACH TRACKING ALGORITHM WHILE “NO. BAD” ROW 
SHOWS THE NUMBER OF LOW SCORES, i. E., THE SCORES LOWER THAN 80% OF THE MAXIMUM ONE IN THE CORRESPONDING ROW. 


initializing it on the ground truth bounding box. The setting 
of Experiment-2 is the same to Experiment-1, except that the 
initial bounding box is randomly perturbed in the order of ten 
percent of the object size. In Experiment-3, the colorful frames 
are converted into grayscale images. 

Firstly, we follow the evaluation protocol to test our method, 
compared with other 27 tracking algorithms provided in the 
benchmark website. The main comparison results can be 
found in Table [II] and Fig. [7] We can see that, in average, 
the proposed method ranks the first for both accuracy and 
robustness comparison. In specific, DeepTrack achieves the 
best robustness scores for all the scenarios while ranks the 
second in accuracy for all the experimental settings. In the 
Fig. [7] one can observe that the red circles (which stands 
for DeepTrack) always locate in the top-right corner of the 
plot. This observation is consistent to the scores reported in 
Table [II] From the result we can see that our DeepTrack 
achieves close while consistently better performances than the 
PET method (27). Other tracking methods that can achieve 


similar performances on this benchmarks are FoT (36), EDFT 
(371 and EGT++ (38) . 

Note that the scores listed in Table [II] and the plots in Fig. [7] 
are rank-based, which is different from the measuring criterion 
used in the CVPR2013 benchmark. It is well-known that the 
evaluation method for visual tracker is not unique and could 
be sophisticated for a specific objective (39) . Usually different 
tracker measures offer different points of view for accessing 
the tracking method. The best performance on the VOT2013 
benchmark justifies the superiority of DeepTrack, from another 
perspective. 

In [28], the authors perform their TGPR tracker on the 
VOT2013 benchmark, without comparing with other trackers. 
We here compare our DeepTrack with the TGPR algorithm, 
which is recently proposed and achieves state-of-the-art per¬ 
formance in the CVPR2013 benchmark. Following the settings 
in (28), we perform the proposed tracker in Experiment-1 
and Experiment-2. The performance comparison is shown in 
Table Hill 
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Precision plots 



a> 
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Success plots 



Fig. 5. The Precision Plot (left) and the Success Plot (right) of the tracking results on the CVPR2013 benchmark. Note that the color of one curve is 
determined by the rank of the corresponding trackers, not their names. 



bicycle 

bolt 

car 

cup 

david 

diving 

face 

gym 

hand 

iceskater 

juice 

jump 

singer 

sunshade 

torus 

woman 

overall 

Exp 1 -TPGR-Rob. 

0 

1.27 

0.40 

0 

0.27 

2.87 

0 

2.87 

1.67 

0 

0 

0 

0.60 

0.20 

0.13 

1.00 

0.71 

Exp 1 -DeepTrack-Rob. 

0.47 

0.07 

0.47 

0 

0.20 

0.80 

0 

0.73 

0.20 

0 

0 

0 

0 

0 

0.07 

0.47 

0.22 

Exp 1-TPGR-Accu. 

0.60 

0.57 

0.45 

0.83 

0.58 

0.33 

0.85 

0.57 

0.56 

0.60 

0.76 

0.59 

0.65 

0.73 

0.78 

0.74 

0.64 

Exp 1 -DeepTrack-Accu. 

0.58 

0.61 

0.51 

0.86 

0.54 

0.35 

0.73 

0.49 

0.54 

0.61 

0.81 

0.66 

0.51 

0.72 

0.76 

0.60 

0.62 

Exp2-TPGR-Rob. 

0 

1.27 

0.20 

0 

0.27 

2.87 

0.07 

3.00 

2.07 

0 

0 

0 

0.33 

0.07 

0.60 

1.00 

0.73 

Exp2-DeepTrack-Rob. 

0.27 

0 

0.33 

0 

0.20 

0.80 

0 

0.27 

0.60 

0 

0 

0 

0 

0.07 

0.27 

0.67 

0.22 

Exp2-TPGR-Accu. 

0.57 

0.57 

0.41 

0.75 

0.58 

0.32 

0.77 

0.53 

0.53 

0.57 

0.73 

0.57 

0.45 

0.64 

0.65 

0.67 

0.58 

Exp2-DeepTrack-Accu. 

0.54 

0.62 

0.49 

0.77 

0.50 

0.36 

0.70 

0.47 

0.53 

0.59 

0.75 

0.62 

0.60 

0.69 

0.69 

0.56 

0.59 


TABLE III 

The performance comparison between DeepTrack tracker and the TPGR tracker on the VOT2013 benchmark. The better 

ROBUSTNESS SCORE IS SHOWN IN BOLD. NOTE THAT FOR ACCURACY (ACCU.), THE COMPARISON IS NOT FAIR IF THE ROBUSTNESS SCORE IS DIFFERENT 

AND THUS NO BOLD ACCURACY SCORE IS SHOWN. 


We can see that the proposed DeepTrack outperforms the 
TPGR tracker in the robustness evaluation, with a clear per¬ 
formance gap. For Experiment-1, one needs to reinitialize the 
TPGR tracker for 0.71 times per sequence while that number 
for our method is only 0.22. Similarly, with the bounding 
box perturbation (Experiment-2), TPGR needs 0.73 times re¬ 
initialization while DeepTrack still requires 0.22 times. Note 
that in Table [Till the accuracies from different trackers are not 
directly comparable, as they are calculated based on different 
re-initialization conditions. However, by observing the overall 
scores, we can still draw the conclusion that the DeepTrack 
is more robust than TPGR as it achieves similar accuracies 
to TPGR (0.62 vs. 0.64 for Experiment-1 and 0.59 v.s. 0.58 
for Experiment-2) while only requires around one third of re¬ 
initializations. 

D. Verification for the structural loss and the robust temporal 
sampling 

Here we verify the three proposed modifications to the 
CNN model. We rerun the experiment on the CVPR2013 
benchmark using the DeepTrack with each modification in¬ 
activated. In specific, the temporal sampling mechanism, the 
label uncertainty and the structural loss is disabled and the 
yielded tracking results are shown in Fig. [8] compared with 
the full-version of the proposed method. Beside, the results 


of two state-of-the-art method, /.e., Struck and TPGR are also 
shown as references. 

From the figure we can see that, the structural loss, the 
temporal sampling mechanism and the label uncertainty all 
contribute the success of our CNN tracker. In particular, the 
temporal sampling plays a more important role. The structural 
loss can increase the TP accuracy by 10% and one can lifts 
the TP accuracy by 4% when the label noise is taken into 
consideration. Generally speaking, the curve consistently goes 
down when one component are removed from the original 
DeepTrack model. That indicates the validity of the propose 
modifications. 


E. Tracking speed analysis 

We report the average speed (in fps) of the proposed 
DeepTrack method in Table IV compared with the DeepTrack 
without the truncated loss. Note that there are two kinds of av¬ 
erage speed scores: the average fps over all the sequences and 
the average fps over all the frames. The latter one reduces the 
influence of short sequences where the initialization process 
usually dominates the computational burden. 

According to the table, the truncated loss boosts the tracking 
efficiency by around 37%. Furthermore, our method tracks the 
object at an average speed around 2.5fps. Considering that 
the speed of TPGR is around 3fps l28l and for the Sparse 
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Precision plots of OPE - fast motion (17) 



Location error threshold 



Location error threshold 


Precision plots of OPE - illumination variation (25) 



Location error threshold 


Precision plots of OPE - low resolution (4) 



Location error threshold 


Precision plots of OPE - out-of-plane rotation (39) 



Location error threshold 




Success plots of OPE - illumination variation (25) 



Success plots of OPE - low resolution (4) 



Success plots of OPE - out-of-plane rotation (39) 



Precision plots of OPE - scale variation (28) 



Precision plots of OPE - background clutter (21) 



Precision plots of OPE - deformation (19) 




Precision plots of OPE - occlusion (29) 



Precision plots of OPE - out of view (6) 



Success plots of OPE - scale variation (28) 



Success plots of OPE - background clutter (21) 




Success plots of OPE - in-plane rotation (31) 



Success plots of OPE - occlusion (29) 



Success plots of OPE - out of view (6) 



Fig. 6. 


The Precision Plot (left) and the Success Plot (right) of the tracking results on the CVPR2013 benchmark, for 11 kinds of tracking difficulties. 
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Ranking plot for experiment baseline 


Ranking plot for experiment region_noise 


Ranking plot for experiment grayscale 
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Fig. 7. The Precision Plot (left) and the Success Plot (right). The color of one curve is determined by the rank of the corresponding trackers, not their 
names. 


Precision plots 




Overlap threshold 


Fig. 8. The Precision Plot (left) and the Success Plot (right) of the results obtained by using different versions of DeepTrack. Note that the color of one 
curve is determined by the rank of the corresponding trackers, not their names. 


Representation based methods the speeds are usually lower 
than 2.5fps (23]. We thus can draw the conclusion that the 
DeepTrack can achieve comparable speed to the state-of-the- 
art methods. 

V. Conclusion 

We introduced a CNN based online object tracker. We 
employed a novel CNN architecture and a structural loss 
function that handles multiple input cues. We also proposed 
to modify the ordinary Stochastic Gradient Descent for vi¬ 
sual tracking by iteratively update the parameters and add 
a robust temporal sampling mechanism in the mini-batch 
generation. This tracking-tailored SGD algorithm increase the 
speed and the robustness of the training process significantly. 
Our experiments demonstrated that the CNN-based DeepTrack 
outperforms state-of-the-art methods on two recently proposed 


benchmarks which contain over 60 video sequences and 
achieves the comparable tracking speed. 
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Experiment-1 

Experiment-2 

Experiment-3 

Averaged 

Accu. 

Rob. 

Accu. 

Rob. 

Accu. 

Rob. 

Accu. 

Rob. 

CNN 

9.60 

7.06 

10.14 

6.09 

8.17 

6.04 

9.30 

6.40 

AIF 

10.29 

14.27 

11.39 

14.43 

9.90 

17.50 

10.52 

15.40 

ASAM 
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14.17 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

CACTuS-FL 

22.27 

19.03 

20.92 

15.24 

21.54 

17.69 

21.58 

17.32 

CCMS 

9.87 

11.76 

8.97 

10.66 

12.36 

16.35 

10.40 

12.92 

CT 

18.92 

15.51 

19.10 

15.30 

18.62 

14.03 

18.88 

14.95 

DFT 

11.23 

15.12 

12.64 

15.47 

12.78 

11.79 
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14.13 

EDFT 

11.45 
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12.72 
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11.18 
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FoT 
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11.23 

13.87 
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9.44 
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17.33 
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LGTpp 
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LGT 

13.27 

9.44 

12.37 

8.08 

16.43 

9.07 

14.02 

8.86 

LT-FLO 
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16.22 

11.25 

14.77 

10.40 

14.41 

10.91 

15.14 

GSDT 
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16.36 

11.98 

14.82 

10.73 

15.90 
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13.94 

14.43 
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14.91 

14.82 

16.90 

17.64 

17.57 

16.05 

16.46 

MIL 

16.38 

14.25 

16.28 

13.58 
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12.54 

15.50 

13.46 

MORP 

20.64 

28.00 

19.65 

27.00 

NaN 

NaN 

NaN 

NaN 

ORIA 

13.13 

16.69 

13.86 

16.15 

11.97 

13.85 

12.99 

15.56 

PJS-S 

12.50 

15.75 

12.31 

15.43 

11.87 

14.89 

12.22 

15.36 

PLT 

10.88 

7.06 

10.58 

6.60 

8.54 

6.73 

10.00 

6.79 

RDET 
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14.84 

16.14 
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15.97 

12.00 
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13.40 

SCTT 

9.36 

16.16 

11.37 

16.43 
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15.68 

9.75 
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STMT 

17.16 

16.81 

17.17 

16.12 

17.12 

13.73 

17.15 

15.55 

Struck 
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13.69 

15.21 

14.02 

12.33 

11.85 

13.82 

13.19 

SwATrack 

13.98 

15.53 

13.93 

14.48 

NaN 

NaN 

NaN 

NaN 

TLD 

13.12 

20.44 

13.37 

20.12 

12.37 

19.00 

12.95 

19.85 


TABLE II 

The performance comparison between CNN tracker and other 
27 TRACKERS ON THE VOT2013 BENCHMARK. FOR EACH COLUMN, THE 
BEST SCORE IS SHOWN IN BOLD AND RED WHILE THE SECOND BEST 
SCORE IS SHOWN IN BLUE. 



Sequence Average 

Frame Average 

With TruncLoss 

1.96fps 

2.52fps 

No TruncLoss 

1.49fps 

1.86fps 


TABLE IV 

The tracking speed of DeepTrack with or without the 

TRUNCATED LOSS. NOTE THAT THERE ARE TWO KINDS OF KINDS OF 
AVERAGE SPEED SCORES: THE AVERAGE FPS OVER ALL THE SEQUENCES 

(Sequence Average) and the average fps over all the frames 
(Frame Average). 
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